Model Fitting

Given that we could just ask the simulation for more samples, I decided to NOT employ a typical training-split. Ratther, I will deploy all samples for training and optimize based on cross validation approaches, given every test record a turn at being in the training set and test set. Once the model is fitted, we can then ask the simulation for some more samples (say 1000) to use as a test set completely related here.

Model Evaluations

Many metrics can be used to evaluate models, some I calculate here are:

  1. Accuracy: TP + TN / total, this is the number of samples the RF model classifiers correctly
  2. Error Rate: 1 - Accuracy, this is the number of samples the RF model classifiers incorrectly
  3. True Positive Rate (TPR) | Sensitivity | Recall | Coverage: TP / (TP + FN), fraction of SAS examples correctly predicted
    • There's a typically a trade off between Recall and Precision below
  4. True Negative Rate (TNR) | Specificity: TN / (FP + TN), fraction of neutral examples correctly predicted
  5. False Positive Rate (FPR): FP / (FP + TN) fraction of neutral examples predicted as having SAS (really bad)
  6. False Negative Rate (FNR): FN / (TP + FN) fraction of SAS examples predicted as having SAS (not as bad, but still bad)
  7. Precision: TP / (TP + FP), fraction of samples that actually have SAS out of total samples predicted to have SAS
    • Precision addresses the question: "Given a sample predicted to have SAS, how likely is it to be correct?"
    • We may want to sacrifice Recall in order to archieve a high precision
  8. F-measure: $\frac{2*precision*recall}{precision+recall}$, a harmonic mean (meaning the resulting metric is *closer to the smaller of the input, that is, F-measure is closer to either precision or recall, whichever is smaller in magnitude) of precision and recall
    • Ideally, a F-measure should be high and indicate that both precision and recall are high
  9. Area under the Curve (AUC): this is the area of the curve under the Receiver operating characteristic (ROC) curve, which demonstrate the specificity-sensitivity trade-off (TPR vs TNR)

Confusion Matrix

Alternative Metric: Cost Matrix

Sometimes, in the realm of health sciences, we want to punish or reward the model for doing something really well compared to other. For example, in cancer prediction, more emphasis is implaced on avoiding False Negatives (not detecting the cancer), so that we may wish to assign costs/weights (negative means reward) to TP, FP, TN, and FN like this:

This can be implemented as other alternative to above metrics during cross validation. Or, cost matrix can be used to classify one particular record. That is, we can use cost matrix to evaluate risk

With a RandomForest, I am able to extract the probability of a sample as showing SAS or not, then say:

P(SAS) = 0.2 P(neutral, other) = 0.8

Given the above cost matrix, then when I:

Test Set Evaluation

Below are peformance measures on the test set never touched during model building

Next steps:

Generating a diversity of inputs:

Transfer of knowledge:

Build the same model using alt_output (means)

Remaining issues:

  1. Presence of NaNs and Infs
    • 3210 samples have NaNs sproadically throughout all features
    • 731 samples have Infs only for feature 7: (Tajima's D on the Y)
  2. How to generate a diverse, well-rounded dataset covering all scenarios possible?
    • Start playing with more complex cases, incorporation of 0.5 vs 0.5 allele frequency
  3. Fine tuning the model to archieve better results: Strong Overfitting and Poor Generalizability
    • Reasons for a high generalization gap:
      • Different distributions: The validation and test set might come from different distributions. Does the simulation process widely differ even if input are of similar format?
      • Number of samples: probably not the issue given 6000+ training and test samples
      • Hyperparameter Overfitting: probably not the issue given low tree depth still causes low test accuracy
      • Bug in the code: in-correctly implemented cross-validation??
      • Inappropriate modeling: worse case, RandomForest simply not doing a great job capturing the signals of SAS, consider using deep learning?
    • What metric do we want to use? (see modeling section for full list of possible metrics)
  4. Speeding up Scripts on TACC
    • 13 hour run time to generate 2000 samples using 1 node, long